We learned about working with strings, factors, and dates:
Today we’ll learn about how to visualize data. Some of today’s examples come from Healy (2017). By the end of this session, you should be able to:
ggplot() callCtrl/Cmd+Shift+F10), clear the console (Ctrl/Cmd+L), and clear your workspaceIs your project still open? If not, click on the project icon to load it. (Don’t create a new one.)
We’ll need:
library(tidyverse)
library(gapminder)Four datasets with nearly identical descriptive stats.
By Alberto Cairo. Get as an R package datasauRus.
Wickham and Grolemund (2017) suggest the following EDA questions:
If you have categorical data, start with a bar chart to summarize the distribution of values.
If you have continuous data, try a histogram.
Sometimes the fastest plot is a base R plot.
hist(diamonds$carat) hist(diamonds$carat,
main="Histogram of carat size",
xlab="Carat size",
border="black",
col="red",
las=1,
breaks=10)ggplot() wayggplot is a tidyverse package by Hadley Wickham that implements Wilkinson’s Grammar of Graphics, a helpful approach for thinking about the components of an effective visualization of data.ggplot.ggplot way (Healy 2017)ggplot way (Healy 2017)dataThis line tells ggplot() which dataset to use and produces a blank plot.
ggplot(data = gapminder)For convenience, we’re going to assign each step to an object called p. You can call it whatever you want. The key idea is that we create a base plot p and add to it in each step. So here, p is just an empty plot. If you want to see the result, you have to print p.
p <- ggplot(data = gapminder)
p glimpse(gapminder)## Observations: 1,704
## Variables: 6
## $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
data and mappingThe first two ggplot() arguments are data and mapping. We’ll drop the data= and mapping= labels from here out.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p <- ggplot(gapminder,
aes(x = gdpPercap,
y = lifeExp)) # same thingaes() functionmapping argument calls for aesthetic mappings of variables to plot elements.aes() you tell ggplot() which variable from the dataset should map to the x-axis, and which should map to the y-axis.gapminder: gdpPercap goes to the x-axis, while lifeExp goes to the y-axis. p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))Not much. We’ve just told ggplot to use the gapminder dataset and to map two variables, but we have not specified the type of plot we want.
pgeom()Use the + sign to add the next layer to this plot—a geom()! In this example, we add geom_point(), the points geom.
p + geom_point() # not assigning to p on purposegeom_point()Check out the help file for your geom to learn more about use or review the great reference material on tidyverse.org: http://ggplot2.tidyverse.org/reference/geom_point.html
?geom_point # learn about argumentsThis geom calculates a smoothed line and shades the standard error. Check out the arguments to geom_smooth() to tinker with the smoothing function used.
p + geom_smooth() p + geom_point() + geom_smooth(method="lm") # change method p + geom_point() + geom_smooth() + scale_x_log10() p + geom_point() + geom_smooth() +
scale_x_log10(labels = scales::dollar) p <- p + geom_point(color="purple",
alpha = 0.3, # color transparency
size=2) +
geom_smooth(method="loess",
color="#FCF221") + # htmlcolorcodes.com
scale_x_log10(labels = scales::dollar)
p p <- p + labs(x = "GDP Per Capita",
y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
p p + theme_minimal()For instance, maybe instead of making all the points “purple”, we want to color the points by values in the variable continent.
p <- ggplot(gapminder,
aes(x = gdpPercap,
y = lifeExp,
color = continent)) p + geom_point() +
geom_smooth(method='loess') +
scale_x_log10()shape to point values ggplot(gapminder,
aes(x = gdpPercap,
y = lifeExp,
shape = continent)) + # changed from color
geom_point() +
geom_smooth(method='loess') +
scale_x_log10()fill to se p <- ggplot(gapminder,
aes(x = gdpPercap,
y = lifeExp,
color = continent,
fill = continent)) p + geom_point() +
geom_smooth(method='loess') +
scale_x_log10() p <- ggplot(gapminder,
aes(x = gdpPercap,
y = lifeExp))
p + geom_point(aes(color = continent),
alpha=0.6,
size=1) +
geom_smooth(method='loess') + # just 1 line
scale_x_log10()The group trends are hard to see. Let’s try faceting by continent to make a series of “small multiples”. First we need to get back to our basic plot defining point and line color:
p <- p + geom_point(color="purple",
alpha = 0.3,
size=2) +
geom_smooth(method="loess",
color="#FCF221") +
scale_x_log10(labels = scales::dollar)facet_wrap() p + facet_wrap(~ continent) p + facet_wrap(~ continent, ncol = 5) +
labs(x = "GDP Per Capita",
y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy on Five Continents",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme_minimal() +
theme(axis.text.x=element_text(size=6))Healy, Kieran. 2017. Data Visualization for Social Science. http://socviz.co/.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.